by Peter de Blanc + ChatGPT Deep Research
Posted to Adarie (www.adarie.com) on April 23, 2025
Content License: Creative Commons CC0 (No Rights Reserved)
In classical game theory, randomized (mixed) strategies are often essential to avoid exploitation. A deterministic (pure) strategy can be countered by a clever opponent, whereas a mixed strategy keeps opponents indifferent. The canonical example is Rock-Paper-Scissors: any fixed action (e.g. always playing Rock) is immediately exploitable (the opponent can always play Paper), while the unique Nash equilibrium is to play each move with 1/3 probability (When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | TDS Archive | Medium). In general, Nash’s theorem tells us that in zero-sum games an optimal strategy may require randomizing among actions – ensuring no opponent can predict and exploit a pattern (Game Theory Unit 5 – Mixed Strategies and Randomization - Fiveable). Von Neumann’s minimax theorem guarantees the existence of such an unexploitable mixed strategy for zero-sum games. Formally, a minimax (Nash) strategy maximizes your guaranteed payoff against a worst-case opponent; any deviation or deterministic commitment can, in principle, yield the opponent a higher payoff (Approximate Exploitability: Learning a Best Response) (Approximate Exploitability: Learning a Best Response). Exploitability is the standard measure of deviation from equilibrium – it quantifies how much more payoff an adversary’s best response can achieve against your policy than against an equilibrium strategy (Approximate Exploitability: Learning a Best Response). A strategy with zero exploitability is at equilibrium (unexploitable), whereas a highly exploitable strategy invites a counter-strategy that can greatly reduce its performance.
Applying these principles to AI agents: a greedy policy (temperature ) that always selects the single top-rated move is a deterministic strategy. Unless the agent’s policy is already at a perfect equilibrium (which in complex games it generally is not), theory guarantees there exist sequences of opponent actions that exploit its predictable choices. In practice, this has been vividly demonstrated in adversarial AI-vs-AI settings. For example, recent work showed that the Go-playing AI KataGo, even at superhuman strength, had specific blind-spot patterns that a trained adversary could exploit reliably (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong) (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong). The adversary learned to steer the game into positions where KataGo’s policy (which was effectively deterministic when picking its highest-value moves) consistently made suboptimal decisions, leading to defeat (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong) (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong). Notably, this adversary could beat KataGo 94% of the time while itself losing to human amateurs (Adversarial Policies Beat Superhuman Go AIs) – a clear sign of exploiting KataGo’s particular deterministic policy, not of superior all-around play. This kind of non-transitive outcome (Adversary beats KataGo; KataGo beats humans; humans beat Adversary) underscores how a strong but deterministic policy can be systematically countered (Adversarial Policies Beat Superhuman Go AIs) (Adversarial Policies Beat Superhuman Go AIs). Theoretically, if KataGo’s policy were a true Nash-equilibrium strategy, such an exploiter would not exist. But in practice, self-play agents can converge to a policy that ranks as very strong on average yet is not fully equilibrium – thus still highly exploitable by a tailored opponent (Adversarial Policies Beat Superhuman Go AIs). In short, deterministic play makes an agent a fixed target. An opponent that can model or learn your policy (even as a black-box by observation) can eventually anticipate every move. Game-theoretic security demands some randomness; without it, a policy-only agent risks being “memorized” and defeated by adversaries that prepare specifically for its predictable patterns (When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | TDS Archive | Medium) (Game Theory Unit 5 – Mixed Strategies and Randomization - Fiveable).
Temperature sampling introduces controlled randomness in action selection. At temperature , an agent samples actions according to its learned probability distribution (the policy’s output probabilities), whereas corresponds to always taking the argmax action. Intermediate temperatures or more sophisticated sampling schemes (e.g. epsilon-greedy) blend exploitation with randomness. The effect of increasing temperature is to flatten the policy distribution, giving lower-rated moves a chance to be played; decreasing sharpens the distribution toward the single best move. This leads to a fundamental trade-off between move quality and unpredictability.
Immediate Strength vs. Optimal Counter-Play: At , the agent is greedily maximizing its evaluated win-rate each move, which often yields the highest performance against opponents similar to those it was trained on. Indeed, if the opponent makes no special effort to exploit our agent, deterministic greedy play should perform at least as well as (usually better than) a stochastic policy that sometimes chooses second-best moves. However, also means no randomness – the agent will respond the same way every time in a given situation. An adversary can exploit this consistency by forcing the agent into a disadvantageous sequence it always falls for. From a game-theoretic view, pure greedy play ignores the minimax criterion (it’s optimizing assuming a fixed distribution of opponent responses, not a worst-case response). It may choose moves that yield a high win probability against standard play but have a hidden drawback that a worst-case opponent could capitalize on. In contrast, a higher-temperature (stochastic) policy injects uncertainty: the opponent now faces a mixed strategy. This can thwart an adversary’s plans – if the opponent’s exploitative strategy relies on the agent making a particular predictable move, a stochastic policy that sometimes “refuses” that move can break the scheme. The opponent is forced to hedge against multiple possible actions, often reducing the effectiveness of a specific exploit. The cost, of course, is that sometimes the agent will not play its top-scoring move, potentially giving up small margins in typical play.
Theoretical Insights on Randomized Policies: In equilibrium analysis, an optimal mixed strategy often sacrifices a tiny amount of expected reward to vastly reduce exploitability. For instance, if two moves are nearly equal in value, playing them with some probability each yields almost the same average outcome but prevents an opponent from focusing on countering just one. In formal terms, the maximin solution might require randomizing among all moves that give the same minimax value. If our agent’s policy has a set of near-optimal actions, sampling among them can approximate an equilibrium by making the agent’s behavior unpredictable while not significantly hurting its expectation. On the other hand, if the policy’s top action is much better than the rest, then a truly robust (minimax) strategy would not randomize away from that action – so injecting randomness in such cases does lower the agent’s theoretical max-min value. The challenge is that a learned policy’s estimations of “nearly equal moves” might be imperfect; it could strongly prefer one action while a slightly weaker move would actually protect against a rare but severe counter. Thus, some deliberate entropy in the policy can insure against the agent’s own evaluation errors or biases. In fact, recent reinforcement learning research in two-player games shows that adding an entropy regularization (which effectively keeps the policy distribution broad) can lead to lower exploitability strategies (Reevaluating Policy Gradient Methods for Imperfect-Information Games). Higher entropy (analogous to higher temperature stochastic policies) acts as a regularizer that discourages the agent from over-committing to a particular line of play that an opponent could exploit. Empirically, policy-gradient methods with larger entropy bonuses produce agents that, while slightly less optimal against standard opponents, are harder for a best-response opponent to beat (Reevaluating Policy Gradient Methods for Imperfect-Information Games). This connects to the idea of quantal response equilibria in game theory, where each player chooses probabilistically (e.g. via a Boltzmann distribution over payoffs) – such softmax policies form equilibria that are smoother than pure Nash, trading off some performance for robustness. In summary, using (stochastic play) can dramatically reduce worst-case exploitability, at the cost of a mild dip in average-case performance. A fully deterministic policy might achieve the highest Elo against non-adversarial opponents, but a mildly stochastic policy can achieve a secure performance guarantee by avoiding deterministic weaknesses (Adversarial Policies Beat Superhuman Go AIs) (When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | TDS Archive | Medium).
A practical and theoretically intuitive approach is to use state or phase-dependent temperature – in effect, randomize more in parts of the game where being unpredictable is valuable and costly exploitation is possible, and play more deterministically in critical phases where accuracy outweighs the benefits of randomness. Many games naturally divide into phases that differ in strategic freedom and risk. For example, in Go or Chess, the opening phase has a wide array of roughly equal alternatives (many moves lead to playable positions), whereas the endgame or late-game often has a single clearly best move (or a narrow set of winning moves). From a game-theoretic perspective, if multiple actions yield nearly the same long-term value (as is often the case in the opening when the game is balanced), a player can mix between those actions without sacrificing outcome – this is essentially what a Nash equilibrium would do if there is no unique best opening move. Randomizing the opening moves prevents an adversary from preparing a specific “book” sequence to trap the agent. It forces the opponent to consider many possible lines, essentially approaching an equilibrium distribution over openings. In contrast, later in the game, the stakes of each move are higher and the evaluation landscape is sharper; here one move might clearly dominate others (e.g. a tactical sequence that wins material, or a move that avoids a losing blunder). In such cases, deviating from the best move even 5% of the time could directly lose the game, which no amount of unpredictability can justify. Thus, a principled temperature schedule would start higher (more stochastic) in the early, strategically rich phases and then gradually lower the temperature as the game progresses to ensure near-optimal play in decisive moments. This idea has been observed in top AI systems: AlphaGo Zero and AlphaZero, during self-play training, sampled actions with in the first 30 moves of each game to encourage exploring different openings, then used afterward (AlphaGo Zero: Opening Study : r/baduk). The result was a diverse opening repertoire learned by the AI. For actual play against humans or other AIs, DeepMind reported using a bit of randomness in casual online games but returning to mostly deterministic play in formal matches (AlphaGo Zero: Opening Study : r/baduk). The rationale is exactly as above – early randomness to prevent opponents from exploiting predictability or using memorized counters, but later determinism to execute the highest-quality moves when unpredictability yields diminishing returns (AlphaGo Zero: Opening Study : r/baduk). Even human grandmasters employ analogous strategies: they vary their opening choices between games (a mixed strategy at the start) so rivals can’t prepare a single anti-opening, but in sharp tactical positions they will consistently choose the objectively best continuation.
It’s worth noting that phase-dependent temperature is a heuristic; formal analyses usually consider the overall game-theoretic solution rather than splitting by phases. However, this approach aligns with the concept of a behavioral strategy in extensive-form games – i.e. a policy that can randomize independently at different states. In a zero-sum game, an equilibrium can indeed be a behavior strategy that randomizes at some information sets (or states) and is deterministic at others. Intuitively, if a state (like an endgame position) offers exactly one superior action, an equilibrium strategy would deterministically take it; if a state (like an opening) offers several equally valued options, an equilibrium might randomize among them. Thus varying the “temperature” by phase is an attempt to approximate this ideal: high entropy when the game has many equally good choices, and low entropy when one choice clearly leads to the best guaranteed outcome.
From a broader perspective, the challenge of choosing the right temperature ties into robust reinforcement learning and game-theoretic planning. A truly robust policy in a two-player adversarial game is one that maximizes worst-case performance – mathematically, . The solution to this max–min problem is an equilibrium policy (for zero-sum games, the Nash equilibrium), which, as discussed, may be stochastic. In practice, if we have a fixed policy network, we can’t magically make it the Nash strategy, but we can try to approximate an equilibrium by appropriate randomization or training adjustments. One principled approach from multi-agent RL is to train a population of policies and compute a mixed strategy over them – for example, Policy-Space Response Oracles (PSRO) and similar algorithms iteratively find best responses and mix them, converging toward Nash distributions. AlphaStar (DeepMind’s StarCraft II agent) employed a league of agents, including “main” agents and specialized “exploiter” agents, and in the end effectively played a mixture over many policies to cover its weaknesses (Adversarial Policies Beat Superhuman Go AIs). This is a more elaborate version of using temperature: instead of one policy network made stochastic, the final agent is a meta-policy that randomizes between different strategies (some aggressive, some conservative, etc.), making it much harder to exploit. The theoretical justification is the same – by incorporating diverse strategies, the agent approaches a Nash-like coverage of the strategy space, forcing opponents to deal with all of them. In contrast, simply sampling from a single fixed network’s distribution is a limited form of randomization, but it is much better than none. Indeed, even without retraining, one can sometimes improve robustness by tuning the sampling method: e.g. upper confidence or randomized regret minimization methods choose actions with probability biased by both value and uncertainty, aiming to avoid being too predictable while mostly playing strong moves. There is also a line of work on bounded-rationality equilibria (like quantal response equilibrium) where each player’s policy is a Boltzmann response to the other’s; if one were to choose a temperature that best protects against an adversary assuming you have a certain rationality level, it could be analyzed in that framework. However, against a truly worst-case (fully rational) exploiter, the only safe policy is the Nash strategy itself – which brings us back to mixing in proportion to true game values.
In summary, theoretical and empirical research strongly support using stochastic policies to balance strength and exploitability. A deterministic () policy tends to achieve higher reward against regular opposition but can be memorized and outwitted by adversaries. Allowing some randomness () moves the policy closer to a game-theoretic mixed strategy, which can dramatically improve worst-case outcomes by masking the agent’s weaknesses. The optimal amount of randomness likely depends on the context: in early-game or symmetric situations, more randomness can be introduced with minimal loss (and indeed may be required for equilibrium play), whereas in critical tactical spots, one should play the best move (low ). While there is no single formula for the perfect temperature schedule, the guiding principle is clear: to avoid exploitation, do not be overly predictable. Theoretical frameworks (like minimax equilibrium analysis) justify injecting entropy into the policy as a means of approximating an unexploitable strategy (Adversarial Policies Beat Superhuman Go AIs) (Reevaluating Policy Gradient Methods for Imperfect-Information Games). In practice, agents achieve this via tuned temperature parameters, entropy regularization in training, or mixtures of policies. The result is a more robust agent that, even if slightly less optimal in the average case, can withstand adversarial opponents in an online setting without catastrophic breakdowns. As one commentary succinctly put it: “Predictable is not always good” (When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | TDS Archive | Medium) – randomness, used judiciously, is a powerful shield against being outsmarted.
Sources: Key theoretical insights were drawn from game theory and multi-agent RL literature, including classical results on mixed-strategy Nash equilibria and minimax optimality (When Stochastic Policies Are Better Than Deterministic Ones | by Wouter van Heeswijk, PhD | TDS Archive | Medium) (Game Theory Unit 5 – Mixed Strategies and Randomization - Fiveable), formal definitions of exploitability (Approximate Exploitability: Learning a Best Response) (Approximate Exploitability: Learning a Best Response), and recent studies on the exploitability of deep RL agents (Adversarial Policies Beat Superhuman Go AIs) (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong). Notably, the vulnerability of deterministic Go policies was demonstrated by Wang et al. (ICML 2023) (Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong) (Adversarial Policies Beat Superhuman Go AIs), and experiments have shown that higher entropy policies can reduce exploitability (Reevaluating Policy Gradient Methods for Imperfect-Information Games). Practical strategies like AlphaZero’s adaptive temperature and AlphaStar’s league training illustrate how these principles are applied in state-of-the-art systems (AlphaGo Zero: Opening Study : r/baduk) (Adversarial Policies Beat Superhuman Go AIs).